Noise Elimination from the Web Documents by Using URL Paths and Information Redundancy
نویسندگان
چکیده
on the performance of the Web information management system. Many researchers have proposed document structure based noise data elimination methods. In this paper, we propose a different approach that uses a redundant information elimination approach in the Web documents from the same URL path. We propose a redundant word/phrase filtering method for single or multiple tokenizations. We conducted two experiments to examine efficiency and effectiveness of our filtering approaches. Experimental results show that our approach produces a high performance in these two criteria.
منابع مشابه
Impulsive Noise Elimination Considering the Bit Planes Information of the Image
Impulsive noise is one of the imposed defectives degrades the quality of images. Performance of many image processing applications directly depends on the quality of the input image. Hence, it is necessary to de-noise the degraded images without losing their valuable information such as edges. In this paper we propose a method to remove impulsive noise from color images without damaging the ima...
متن کاملSemantic web access prediction using WordNet
The user observed latency of retrieving Web documents is one of limiting factors while using the Internet as an information data source. Prefetching became important technique to reduce the average Web access latency. Existing prefetching methods are based predominantly on URL graphs. They use the graphical nature of HTTP links to determine the possible paths through a hypertext system. Althoug...
متن کاملWebclass: Web Document Classiication Using Modiied Decision Trees
Searching for Web sites is one of the most common tasks performed on the Web. Web page classi cation is the rst step for Web search service construction. This paper proposes a system, named WebClass, for classifying Web documents by using a height-three modi ed decision tree which splits the root, depth-one nodes, and depth-two nodes on the keywords, descriptions, and hyperlinks, respectively. ...
متن کاملA Research on Web Content Extraction and Noise Reduction through Text Density Using Malicious URL Pattern Detection
A Web Page has large amount of information including some additional contents like hyperlinks, header footer, navigational panel; advertisements which may cause the content extraction to be complicated. Page Segmentation is used to detect the noisy content block by detecting malicious URL from Web Pages. Main aim of this research is detecting malicious URL during content extraction by checking ...
متن کاملNoise-tolerance feasibility for restricted-domain Information Retrieval systems
Information Retrieval systems normally have to work with rather heterogeneous sources, such as Web sites or documents from Optical Character Recognition tools. The correct conversion of these sources into flat text files is not a trivial task since noise may easily be introduced as a result of spelling or typeset errors. Interestingly, this is not a great drawback when the size of the corpus is...
متن کامل